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Abstract 

The performance of Orthogonal Matching Pursuit (OMP) for variable selection is analyzed for random 
designs. When contrasted with the deterministic case, since the performance is here measured after 
averaging over the distribution of the design matrix, one can have far less stringent sparsity constraints 
on the coefficient vector. We demonstrate that for exact sparse vectors, the performance of the OMP is 
similar to known results on the Lasso algorithm [IEEE Trans. Inform. Theory 55 (2009) 2183-2202]. 
Moreover, variable selection under a more relaxed sparsity assumption on the coefficient vector, whereby 
one has only control on the l\ norm of the smaller coefficients, is also analyzed. As a consequence of 
these results, we also show that the coefficient estimate satisfies strong oracle type inequalities. 



1 Introduction 

Consider linear regression model, 

Y = Xf3 + e (1) 

where X £ M. nxp , the coefficient vector j3 £ K p and noise e £ K™. The high dimensional case, where p 
is of the same order, or possibly much larger than n, has been of immense interest nowadays. In many 
applications, interest is not primarily on prediction of the response Y, but on the accuracy of estimation of 
the coefficient j3. Examples of such applications include, micro-array data analysis, graphical model selection 
(19j . compressed sensing [TO], [5], and in communications [3] , [2] , [H] ■ As is well known, in the high dimensional 
setting, P is unidentifiable unless the design matrix X is well-structured and there is some sparsity constraint 
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on the coefficient vector /3. This sparsity assumption corresponds to restricting /? to few non-zero entries 
(£o-sparsity) , or more generally, assuming that /3 has only few terms that are large in magnitude. 

The Orthogonal Matching Pursuit is a variant of the Matching Pursuit algorithm 18:, where, successive 
fits are computed through the least squares projection of Y on the current set of selected terms. For 
deterministic X matrices, variable selection properties of this algorithm, for £o _ sparse vectors, have been 



analyzed for the noisy case in Zhang [35] and Cai and Wang [5J- However, as we shall review Subsection 1.2 



although they give strong performance guarantees under certain conditions on the X matrix, they impose 
severe constraints on the sparsity of /?. Similar results have been shown for the Lasso, for example in Zhao and 
Yu (30] .With random designs one can have reliable detection of the support with far less stringent sparsity 
constraints; the performance is here measured after averaging over the distribution of X. For example, 
Wainwright |26j proved such results for the Lasso algorithm. The main results of this paper, apart from 
showing that similar properties hold for the OMP, demonstrate two important additional properties. Firstly, 
we give results on partial support recovery, which is important since exact recovery of support places strong 
requirements on n if some of the non-zero elements are small in magnitude. Secondly, and more importantly, 
we relax the assumption that f3 is £o~sparse and address variable selection under a more general notion of 
sparsity, whereby one has only control on the l\ norm of the smaller elements of /3. We demonstrate that 
even under this more relaxed assumption, one can reliably estimate the position of the larger entries using the 
OMP. This has certain parallels with recent work on the Lasso by Zhang and Huang [27: ■ As a consequence 
of these results, we show that our coefficient estimate, after running the algorithm, satisfies strong oracle 
inequalities, similar to that demonstrated for the Lasso [29] and Dantzig selector [6]. 

The paper is organized as follows. Below, we describe the OMP algorithm. The stopping criterion we use 



is slightly different from what is traditionally used in literature. Subsection 1.2 motivates in greater detail 



our interest in random designs. In Subsection 2.1 we give results for design matrices that have i.i.d sub- 
Gaussian entries and £o _s P arse vectors. This extends the results in Tropp and Gilbert [25] for the noisy case. 
In Subsection |2.2| we describe more general results with correlated Gaussian designs, where we only have 
control over the t\ norm of the smaller coefficients. Sections [3] [4] and [5] gives proofs of our main results. The 
appendices contains auxiliary results. 



1.1 The Orthogonal Matching Pursuit algorithm 

Denote as J = J\ = {1, 2, . . . , p} to be the set of indices corresponding to columns in the X matrix. For 
each step i, with i > 1, a single index a(i) is detected to be non-zero in that step. Accordingly, denoting 
d(i) = a(l) U a(2) . . . U a(i) as the set of detected columns after i steps, step i + 1 of the algorithm only 
operates on the columns in J. i+1 = J — d(i), that is, the columns not detected in the previous steps. In other 
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words, indices detected in previous steps remain detected. 

The decision on whether a particular index j is detected during a particular step i is based on the absolute 
value of a statistic Z$j. Here, Z^ is simply the inner product between Xj and the normalized residual R4-1 
computed for the previous step. 

Apart from the response vector Y and design matrix X, the other input to the algorithm is a positive 
threshold value r. Denote ||.|| as the euclidean norm. We now describe the OMP algorithm. 

• Initialize Ro = Y, d(Q) = 0. Start with step i = 1. 

• Update 

^ = I iH' JEJl - 

• If maxj e j ; \Zij\ > r, do the following: 

— Assign a(i) — argmax{|Zj j ;| : j £ J{\. 

— Set d(i) — d(i — 1) U a(i). Update R4 = (I — Vi)Y, where V% is the projection matrix for the 
column space of X^, and set J^+i = J, — a(i). 

— Increase i by one and go to step 2. 

• Stop if maxjg^ \Zij\ < t. 

We remark that for any step i, the inner product XJR4-1, for j € d(i — 1), is 0. Correspondingly, since 
Zij = 0, for j £ d(i — 1), the maximum of Z^ over j £ Jj, is the same as the maximum over all j £ J. Also, 
the newly selected term a(i) may be equivalently expressed as, 

a(i) = argmin inf \\Y — Fiti—i ~ wXj\\ 2 , 

where Fiti-i is the least squares fit of Y on the columns in d(i — 1). In this respect, the OMP is similar to 
other greedy algorithms such as relaxed greedy and forward-stepwise algorithms ([4], [15], [16], [E]), that 
operate through successive reduction in the approximation error. 

As mentioned earlier, the stopping criterion considered here is slightly different from that considered in 
literature. Traditionally, for the no noise setting, the algorithm is run until there is a perfect fit between Y 
and the selected terms, that is Ri = (see for example [23], [2S]). In the noisy case, as analyzed over here, 
there are two standard approaches. The first, as done in [5], [2H], is to stop when max j6 j \XJRi-\\ is less 
than some fixed threshold. The second approach, as analyzed in [12], [5], is to stop when ||i?i|| is less than 
some pre-specified value. 
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Our stopping criterion, which is more similar to the first approach, is equivalent to continuing the algorithm 
until maxjgj |AJi?,;_i| < r||Jij_i||. The motivation for the use of such a statistic comes from the analysis 
of a similar iterative algorithm in Barron and Joseph [5] for a communications setting. However, there the 
values of the non-zero /3j 's were known in advance; this added information played an important role in the 
analysis of the algorithm. A similar statistic was used by Fletcher and Rangan [Tl] for an asymptotic analysis 
of the OMP for exact support recovery using i.i.d designs. 

Notation: Let a = a(n, p, k), b — b(n, p, k) be two positive functions of n, p and k. We denote as a = 0(b), 
if a < c\b for some constant positive constant c\ that is independent of n, p or k. Similarly, a — Q(b) means 
a > 02b for positive c-i independent of n, p or k. 

1.2 Related work 

As mentioned earlier, we are interested in variable selection in the high dimensional setting. Apart from 
iterative schemes, another popular approach is the convex relaxation scheme Lasso [22) . In order to motivate 
our interest in random design matrices, we describe existing results on variable selection, using both methods, 
with deterministic as well as random design matrices. For convenience, we concentrate on implications of 
these results assuming the simplest sparsity constraint on /?, namely that /3 has only a few non-zero entries. 

In particular, we assume that, 



In other words, attention is restricted to all fc-sparse vectors, that is, those that have exactly k non-zero 
entries. For convenience, we drop the dependence on /3 and denote So(/3) as So whenever there is no 
ambiguity. The simplest goal then is to recover So exactly, under the additional assumption that all j3j, for 
j € Sq, have magnitude at least /3 m i n , where /3 m ,„ > 0. Denote as C = C(f3 m i n , k), as the set of coefficient 
vectors satisfying this assumption. 

Further, denote S as the estimate of So obtained using either method, and £ — {S ^ So} the error event 
that one is not able to recover the support exactly. For deterministic X, interest is mainly on conditions on 
X so that 



can be made arbitrarily small when n, p, or k become large. Here P^.jX) denotes the distribution of Y for 
the given X and j3. 

A common sufficient condition on X for this type of recovery is the mutual incoherence condition, which 
requires that the the inner product between distinct columns be small. In particular, letting \\Xj\\ Jn = 1, 



| So 03) | = k, where S Q (J3) = {j : f3 3 + 0}. 



(2) 



P err .x =su P P /3 (£\X) 
pec 



(3) 
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for all j 6 J, it is assumed that 

7(A) = - m&x\X jXj,\ (4) 
n jjtj' 

is 0(l/fc). Another related criterion is the irrepresentable criterion |23| . |30| . which assumes, for all subset 
T of size k, that 

||(^Xr) _1 Xpr,-||i < 1, for all jeJ-T. (5) 

Here denotes the l\ norm. 

Observe that if P err ^x ^ is small, it gives strong guarantees on support recovery, since it ensures that 
any /3, with |£o(/3)| = k 7 can be recovered with high probability. However, it imposes severe constraints 
on the A matrix. As as example, when the entries of X are i.i.d Gaussian, the coherence 7(A) is around 
a/2 logp/n. Correspondingly, for (jij to hold, n needs to be Sl(fc 2 logp). In other words, the sparsity k 
should be 0(y/n/ logp), which is rather strong since ideally one would like k to be of the same order as n. 
Similar requirements are needed for the irrepresentable condition to hold. Recovery using the irrepresentable 
condition has been shown for Lasso in [30], [26], and for the OMP in [28], [5]- Indeed, it has been observed, 
in [30] for the Lasso, and in [28], for the OMP, that a similar such condition is also necessary if one wanted 
exact recovery of the support, while keeping P er r,x small. 

A natural question is to ask about requirements on X to ensure recovery in an average sense, as opposed to 
the strong sense described above. One way to proceed, as done over here, is to consider random X matrices 
and ask about the requirements on n, p, k, as well as (3 m im so that 

P err =su P P /3 (£) (6) 
pec 

is small. Here P^ (8) = ExVp (8\X), where the expectation on the right is over the distribution of X. 
For the Lasso, Wainwright [26J considers random X matrices, with rows drawn i.i.d N p (0, S). It is shown 
that under certain conditions on E, which can be described as population counterparts of the conditions 
for deterministic A's, one can recover Sq with high probability with n = fl(klogp) observations, with the 
constant depending inversely on /3^ in . The form of n is in a sense ideal since now k = 0(n/ logp) is nearly 
the same n, if we ignore the logp factor. As mentioned earlier, apart from establishing similar properties to 
hold for the OMP with fc-sparse vectors, we also demonstrate strong support recovery results under a more 
general notion of sparsity. These results are described in the next section. 

We also note that instead of averaging over X, one could assume a distribution on (3 and analyze the average 
probability of 8 over this distribution. This is done in Candes and Plan [7] for the Lasso. Here, for fixed 
magnitudes of the k non-zero /3, the support of j3 is uniformly assigned over all possible subsets of size k. 
Once the support is chosen, the signs for the non-zero /3y's are assigned ±1 with equal probability. If Avg[.] 
denotes the expectation with this distribution of /3, it is shown that one could keep Avg [P^ (8 \X)\ low for 
j(X) as high as 0(1/ logp). This condition on 7(A) is less stringent than before and leads to a demonstration 
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that n = fl(k\ogp) is sufficient for support recovery, provided |||X||| ~ Wp/n, where |||.||| denotes the spectral 
norm. We provide comparisons with this work in Section [6j 

Notation: For a set A C J, we denote as X^ the sub-matrix of X comprising of columns with indices in 
A. Similarly, for any px 1 vector (3, we denote as /?4 the \A\ x 1 sub-vector with indices in A. Also let 
A c = J - A. 



2 Results 

Before discussing our main results with Gaussian matrices, in Subsection |2.1| we state results when the entries 
of X are i.i.d sub-Gaussian and when the vector j3 has k non-zero entries. The noise vector is also assumed 
to come from a sub-Gaussian distribution with scale a. This generalizes the results of Tropp and Gilbert 
|25j for the noisy case. While preparing this manuscript we discovered that Fletcher and Rangan }14| have 
analyzed the OMP for i.i.d designs and for /c-sparse vectors, similar to that in Subsection |2.1| However, 
there the analysis was for exact support recovery and was asymptotic in nature. Further, they focused on a 
specific regime, where fc/^^/c 2 tends to infinity. We provide more comparisons with this work later on in 
the paper. 

We show that n ~ Q,(k\ogp) samples are sufficient for the recovery of any coefficient vector with f3 m i n that 
is at least the same order as the noise level. More specifically, define 

fi n = V '(2 log p)/n. (7) 

The quantity cr/i„ can thought of as the noise level. To see why this is so, consider the orthogonal design 
where X T X/n = I and noise e ~ N(0,a 2 I). Assume that, as usual, we are interested in recovering any /3 
with |So(/3)| = k. A natural estimate of the support would be, 

S = {] ■ \zj\ > t} with Zj =XJY/n, (8) 

where t is positive. Notice that zj ~ N(f3j,a 2 /n) for each j 6 J. Correspondingly, since Zj ~ N(0, a 2 /n), 
for j G J — Sq, one sees that t has to be of the form cr/i„ in order to prevent false discoveries with high 
probability. Similarly /3j, for all j G So, has to have magnitude at least cr/i„ if one wanted to avoid false 
negatives. 



The analysis of iid designs, as done in Subsection 2.1 forms an important ingredient to compressed sensing 



[9], [10]. However, it may not be useful for statistical applications, where typically the choice of the X matrix 
is not under ones control. Accordingly, in Subsection |2.2| we assume that the rows of X are drawn i.i.d from 
N p (0, S), with certain assumptions on S. This model was also employed to detect the neighborhood of a 
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node in high dimensional graphs by Meinshausen and Buhlmann |19j . Moreover, we relax the assumption 
that f3 is fc-sparse and only assume that there is a set S = S(f3), of size k, such that @s a is sparse in a more 
general sense. Here Ps c denotes the vector of coefficients outside of S. More specifically, for a constant 
v > 0, if 

S={j: > av(x n } , with \S\ = k, (9) 

we assume 

\\Ps4l <<"7Mn, (10) 

for an appropriately chosen r\. A natural choice would be to take v =\. Then, S would correspond to the 
indices above the noise level. We show that for 77 not too large, the OMP can detect the large indices in S 
with high probability, provided £ satisfies certain conditions. As a consequence of these results, we show 
that the coefficient estimate satisfies strong oracle inequalities. 

2.1 Recovery with sub-Gaussian designs 

In this section we address the requirements on n, p, k as well as (3 m in, to recover the support of j3, either 
exactly or nearly so, where we assume that |Sq(/3)| = k. Here So((3) is as in We allow the case that k 
may be zero. Further, since it may not be a realistic assumption that k is known, we assume that we only 
know an upper bound k on k, with k > maxjfc, 1}. 

Let Xej, for t = 1, . . . , n and j = 1, . . . , p, denote the entries of the X matrix. Throughout this section 
we assume that the Xp^s are independent sub-Gaussian with mean and scale 1, that is Ee tx « < e* /2 , for 
t € K.. Further, we assume that the noise vector e is independent of X and has independent sub-gaussian 
entries with mean and scale <r, that is Ee iQ < e""* 2 / 2 , for t € K., i = 1, . . . , n. Additionally, if k > 1, we 
assume that the following two conditions are satisfied with high probability. 

Condition 1. There exists A max > X m in > 0, so that the eigenvalues of Xg o Xs /n are between X min and 
Amou that is 

X max \\v\\ 2 > \\X So v\\ 2 /n> X mm \\v\\ 2 for all v e R k . 
Condition 2. The £2 norm of the noise vector is bounded, that is ||e|| 2 /n < er 2 A, for some A > 0. 

Let £ con d be the event that Conditions 1 or 2 fail. The first assumption is related to the restricted isometry 
property (Candes and Tao [S]) and the sparse eigenvalues conditions (Zhang and Huang [27 ). Condition 1 
is satisfied for a wide variety of random ensembles. For example, it is satisfied with high probability for the 
Gaussian ensemble, where the Xgj are i.i.d A^(0, 1) and the binary ensemble, where the Xij are i.i.d uniform 
on { — 1, +1} (see for example, Baraniuk et al. [1]). Notice that since we are interested in controlling the 
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probability P err in because of the averaging over X , we do require that the Condition 1 hold uniformly 
over all So, with \Sq\ = k. Condition 2, which bounds the £2 norm of the noise vector, is required for 
controlling the norm of the residuals Ri. It is satisfied with high probability, for example, when the noise 
e~N(0,a 2 ). 

Below, we state the theorem giving sufficient conditions on n for reliable recovery of the support of /3. The 
threshold r is taken to be 

r= v/2(l + a )logp, (11) 

for some a > 0. Here n will be a function k and p, as well as the various quantities defined above. The 
results of course hold with k replaced by k, provided k is non-zero. In particular, for a, 5 > 0, define 

£ = 5) = max {(1 + 6)n, a 2 r 2 f{8)/{ka)} . (12) 

where, 



max{A m(M , A} 

'i = 

and 



ri = A3 T2 

vain 



(13) 



f{8) = ^ (14) 

(l-l/VT+S) 2 

Denote as S — S(Y,X, r), the estimate of the support obtained after running the algorithm with the given 

Y, X and threshold r. Further, denote the undetected elements of the support as F = Sq — S. The theorem 

below, provides bounds on the signal strength of the undetected components; here we assume that 

jeF 

£ 01 = if F = 0. 

The following function of k characterizes the probability of failure of the algorithm. 

p e „, k = P{£ cond ) + 2{k + l)/p a + 2k/p 1+a , for k>l, (15) 

and Perr,o — 2/p a . Here, recall that £ con d is event that Conditions 1 or 2 fail. Notice that p er r,k < P err k, 
since k < k. 

Regarding the choice of a, if k is O(logp), then a can be taken to be slightly larger than for p err ,k to be 
small, assuming p is large; however, if k scales, for example, linearly with p, then a needs to be taken to be 
larger than 1. We now state our theorem. 



Theorem 2.1. Let the threshold r be as in (11). Further, let n be of the form 

n = ekr 2 , (16) 



with £ as in \ 12). 
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Then, if k > 1, the following condition holds, except on a set with probability p err ^ k ■ 

SCS and ^2^<a\F\. (17) 

In particular, if p^in > a then S = Sq, that is the support is recovered exactly, with probability at least 

1 Perr, k • 

If k = 0, S — with probability at least 1 — p er r,o- 

Notice that a controls accuracy to which the support is estimated. Assuming F is non-empty, another way 
of stating the theorem is that the average signal strength of the undetected components, that is ||/3p|| 2 /|i^|, 
is at most a. It may seem desirable to make a as small as possible, however, doing so increases the value of 
n in (16), since n is inversely related to a through £(a, S). Further, if a is taken to be less than P^ n , then 



the above theorem guarantees exact recovery of the support. Correspondingly, from (16) and (12), one sees 
that if 



&2_ 

2 

nil 

for some b\, bi > 0, then the support can recovered exactly with high probability. 



n = max <j b x k, — } logp, 



The following corollary, which is a consequence of Theorem 2.4 shows that if n — fl(k logp), one can reliably 
detect the indices with large coefficient values, while ensuring that there are no false discoveries. Further, if 
all the non zero components are above the noise level (up to a constant factor), one can estimate the support 
exactly with the same number of observations. 

Corollary 2.2. Define £ = 32r^(l + a) and r = 2r 2 \/TTa. Let 

n>t]k\ogp. (18) 
Then, if k > 1, with probability at least 1 — p er r,k, the estimate S is contained in Sq and further, 

: {fol > raVkfinj C S. 

Further, if B m - m > r a[i n , then algorithm can recover the entire support of B, that is S = Sq, with probability 
at least 1 — p err ^ k ■ 



Ifk — 0, then S — with probability at least 1 — p err . q. Here p er r,. is a s in (15) 



2.2 More general results with Gaussian designs 



For Gaussian ensembles, the methods used in the proof of Theorem 2.1 can be extended to give more general 
results on support recovery. In particular, we relax the assumption that X has i.i.d entries and assume that 
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rows of the X matrix are i.i.d N p (Q, E). The noise vector is assumed to be independent of X, with entries 
i.i.d. N(0, a 2 ). As mentioned earlier, here we also address a more general type of variable selection question, 
where we are not interested in recovering all non-zero entries but only the ones that are large compared to 
the noise level. In particular, for a constant v > 0, let S be a set of size k as in consisting of the indices 
corresponding to the larger elements (in magnitude) of j3. Once again, we do not assume that k is known, 
but only assume that we have an upper bound k on k, with k > 1. Unlike before, we do not require that the 
coefficients outside of S are zero, but only assume that that ||/3,g<=||i < crr]fi n , where r\ is allowed to scale at 
most linearly with k, that is we assume that fj — rj/k is 0(1). 

Through a permutation of the columns one can, without loss of generality, write E as 

_ ^ss Sss<= 

E — 

where for A, A' C J, Ha.A' = Cov(Xi,A> Xi,A') i s the covariance matrix between terms in A and A'. We 
denote the elements of the matrix as try , or Ey , and use both notations interchangeably. Without loss, we 
assume that <jjj = 1 for all j, since if this were not the case, we could always scale the coefficient vector to 
produce such a correlation matrix. 

We make the following assumptions on the correlation matrix E, when k > 1. These are essentially population 
analogs of the sparse eigenvalue and the irrepresentable conditions respectively. 

1. There exists s m i n , s max > so that, 

^min(^TT) > Smin and A max (ETT) < Smin, (19) 

uniformly for all subsets T, with \T\ = k. Here A TO i„(A), \ ma ,x(A) denotes the minimum and maximum 
eigenvalues respectively of a square matrix A. 

2. For some uj € [0, 1), the following holds, 

max^ ||E^E Tj -||i < u, (20) 

uniformly for all subsets T of size k. This is essentially the population analog of the irrepresentable 
condition 



Additionally, for k > 1, we make the following assumption that imposes bounds on certain interactions 
between /3gc and the correlation matrix E. As stated below, they are not very intuitive. Lemma |2.3[ however, 
shows that under a simple condition, which controls the magnitude of correlations of the off diagonal elements 



of E, and along with (10), one can show (19 1 - (21) to hold. 
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Let £sc|s — Ssc S c — SgcgSg^Essc, denote the variance of the conditional distribution of Xi,s c given Xi^s, 
where we recall that S is the subset of indices comprising of the k largest elements (in magnitude) of (3. Let 
fi n be as in ([7]). We make the following additional assumption. 



3. For constants f±, v\ > 0, the following holds, 

II^SS^SS^AsHloo < <yv\[i, n and ||£ S c| S /3sc||oo < <rvi fi n . 



(21) 



Notice that condition (21 1 is not required when /3 is exactly sparse, that is when it has k non-zero entries, 



since in this case (3s c is identically equal to zero. In this case, assumptions (19 20 ) for exactly sparse vectors 



are identical to the sufficient conditions for support recovery for the Lasso by Wainwright 



1. Condition 



As an example, for the standard gaussian design, condition ( 19 1 is satisfied with s m i n = s m[ 
( 20 1 is satisfied with ui — 0. Condition ( 21 ) reduces to requiring that maxjggc \/3j | < o~V\ \i n , which is satisfied 



with v\ = v. 



For the case k = 0, instead of (19 1 - (21), we only make the assumption, 

||E/3||oo < W\ (J, n . 



(22) 



Notice that since in this case 5 = and J — S c , alternatively, one may express the left side of the above as 

l|S S c|s/?S<=||oo- 

It is well known, see for example Cai and Wang Tropp [53], that if the correlations between any two 
distinct columns are small, as given by the incoherence condition, it implies both the sparse eigenvalue 



condition (19) as well as the irrepresentable condition (20). We use these results to give simple sufficient 



conditions for (19) - (21), as well as (22) when k — 0, in the following lemma. For this, define the coherence 
parameter, 

7 = 7 (E) = max |E, y |. (23) 
Further, recall that fj = r\jk. Then we have the following. 

Lemma 2.3. Let S, with \S\ — k, be as in Q). Assume that the correlation matrix E satisfies, 

7(E) < w /(2k), where < cj < 1. (24) 
Further, assume that the coefficient vector fj satisfies, for some r\ > 0, 

\Ws4i <o-Wn- (25) 

Define: 

Smin = l-Wo/2 s max = l+u /2 uj = u (26) 

vi = oj fj Vi = v + u> fj, (27) 
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Then, conditions (19) - (21) holds, for k = 1, ... , k, with the above values of s m i n , s max , u>, v\ and i>\. 



Ifk = 0, condition (22) holds with V\ in (21) 



The above lemma is proved in AppendixjC] Equation ( 24 1 controls the maximum correlation between distinct 
columns and can be regarded as the population analog of the incoherence condition Q. Condition (25) 
imposes that has l\ norm that is 0(?7/x n ), where as mentioned before, r\ is allowed to scale at most 
linearly with k. 



Henceforth, for convenience sake, assume that we have control over the incoherence parameter as in ( 24 ) and 
that ft satisfies (25). Further, the quantities s m i n , s max , u>, v\ and v\ will be as in (26) and (27). 



Condition (25) is more appropriate than an l\ constraint on the whole vector /3 since it does not impose 



any constraint on the larger coefficient values. Since the j3j, for j € S c , has magnitude at most o~v[i n , which 
is of the same order as the noise level, it makes sense for any algorithm to only estimate S accurately. In 



Theorem 2.4 below, we give sufficient conditions on n so that one can reliably estimate S. We note that 
this goal is different from that required in Zhang and Huang |27j for support recovery with approximately 
sparse (3. There, the only constraint on f3 was that H/S^JIi = 0(rifi n ), for some set A , with \Aq\ = fc, and 
where rj is also allowed to grow at most linearly k. Since there was no constraint on the magnitude of /3j, for 
j G Aq, some these /3j's may have magnitude as high as 0(kfi n ). For this reason, it made no longer sense to 
estimate Aq accurately. Their criterion for an estimate S to be good was that |S| = O(k) and that the least 
squares fit of Y on the columns in S produced a good approximation to Xfi. 

The quantities X m in, ^max and A are redefined here. These will now be expressed as functions v, ujq and rj 
using the various quantities s m i n , s maxi u>, v\ and v\ defined in (26) and (27). 



We will need that the quantity h = \Jkfn-\- /i„ to be strictly less than one. Below, we arrange n > 2klogp. 
Correspondingly, one sees that ft < 1 if, for example, k > 5 and p > 8. Let hi = (1 — h) 2 and h u = (1 + ft) 2 - 
We define the values of X m im X max and A in the following manner: 



Further, 



Xmin — Sminhi and X max ^max^u- 



X = (l + s 2 max u 1 2 + u 1 f}) (l + fc- 1 / 2 )' 



(28) 



(29) 



Let ri be as in (13), now replaced with the above values of X m im X max , X. The quantity r 2 is now given by, 

(30) 



>'2 



(1-w) V\ + 



1 + V\T\ 



Xmir 



Notice that for the i.i.d Gaussian ensemble and when j3 is fc-sparse, the quantities oj, v\, v\ and r\ can be 



taken as zero. Correspondingly, r 2 has the same form as that in (13) 
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Further, let £ = £(a, S) be as in (12), with r\ and ri appearing in its definition replaced with the values of 
these quantities defined above. The quantity p er r,k, for k > 1, which controls the probability of failure of 
the algorithm, is defined as, 

Perr, k = 4/p + [(fc + l)/p Q + . (31) 



We define p er r,o — 1/p + \J (2/ 7r )/( T P a )- The threshold will now be denoted as t\. It will be greater than r 
by a factor p > 1. This factor is strictly greater than one if f3 is not £o-sparse or if 7(E) is non-zero. We are 
now in a position to state our main theorem. 



Theorem 2.4. Let the assumptions of Lemma 2.3 hold. Set the threshold as t\ = pr, where r as in (11), 
and 

1 — w 

Further, let 

n = ikrl. (33) 
Then, if k > 1 i/ie following holds with probability at least 1 — p e rr.k- 

SCS and ^p]<a\F\, (34) 



where F = S — S . In particular, if (3 2 > a, for all j € S*, </ien S = S with probability at least 1 — p, 



'err, k • 



If k = 0, one has that S — mt/i probability at least 1 — p er r,o- 



Before stating the analog of Corollary |2.2[ as an aside, we give implications of the above theorem for exact 
recovery of support for fc-sparse vectors and i.i.d designs for large n, p and k. This will help in understanding 
the results of Theorem 12^41 better. 



In [55] it was shown that for A:-sparse vectors and i.i.d Gaussian designs that there is a sharp threshold, 
namely n x 2fclogp, for exact recovery of the support as n, p, k, as well as kffi^^j a 1 , tends to infinity. This 
was also proved for the OMP in [14] , under an additional condition on rate of increase of the signal-to-noise 
ratio (||/3|| 2 /er 2 ). We can get similar results using our method by recalling that for i.i.d Gaussian designs and 
exact sparse vectors, s m j„ = s max = 1 and ui, v\, ii\ and 77 are all zero. Further, take k = k. Correspondingly, 



since h goes to 0, the quantities A„ ljn , X max and A in (28 29) tend to 1 as n, p and k become large. This 



implies that r\ tends to one and r^ (30) tends to 2. Further, as k^^/a 2 tends to infinity, one may also 



allow ka/a 2 tend to infinity, while keeping a < (3 m in- From Theorem 2.4 this will ensure that the support 



will be recovered exactly. Next, let's evaluate the quantity £ (12) appearing in the expression for n. As 
ka/a 2 tends to infinity, one sees that the first term in the maximum in (12) is the active one and hence £ 



tends to (1 + 5) (using r± tends to 1). One may also appropriately choose S to tend to zero, making £ tend 



2 RESULTS 



14 



to 1. Accordingly, from ( 33 ) , one sees that if n w 2(1 + a)k\ogp, for large k, p, one can recover the support 
exactly, with probability at least 1 —p e rr,k- When j3 is extremely sparse, for example, when k — 0(\ogp), 
then it is possible to arrange for a to decrease to 0, while making p erri k a lso to 0. In this case, one gets the 
threshold n ~ 2k logp for exact recovery. However, in the regime where k is not negligible compared to p (for 
example, when k/p is constant), then our results only allow for a to tend to 1 (from above), so as to ensure 
Perr,k goes to zero. In this case our results are slightly inferior, requiring n w Aklogp for exact recovery. We 
remark in Section [6] on how the results in [14] may be carried over to the general case analyzed here. 



We now state the analog of Corollary 2.2 The goal now is not to recover the non-zero entries, but only those 



that are large compared to the noise level, which is a subset of S. We have the following. 

Corollary 2.5. Let the assumptions of Lemma\2.3\ hold and set the threshold to be T\ as in Theorem\2.J\ 



Define £ = 32(r 2 p) 2 (l + a) and r = 2r 2 p\/l + a, where r 2 as in (30). Let 

n > ^klogp. (35) 
Then, if k > 1, with probability at least 1 —p er r,k, the estimate S is contained in S and, 

{j:\^\>raVkfinjcS. (36) 
Further, if \/3j\ > rap n , for all j £ S, one has S = S with probability at least 1 —p err ,k- 
If k = 0, then S is with probability at least 1 — p err .o- 

Corollary |2.5| gives strong performance guarantees for the OMP under an incoherence property on the 



correlation matrix and an l\ constraint on the smaller coefficients. From (36), one sees that the larger 
coefficients, that is, those with magnitude H,(Vkp n ), are contained in S with high probability. Better 
performance can be demonstrated when all /3j's, for j G S, have magnitude fl(p n ). In this case, it is possible 
to recover S, while ensuring that there are no false positives. This is in a sense ideal, since it is nearly 
what one would expect in the orthogonal design case discussed in the beginning of Section [2j In this case, 
assuming S is as in |sj), one sees that in order to prevent false positives, t needs to be S!(/i„). Thus \f3j\, for 
j € S, also needs to be n,(p n ), with a slightly larger constant, to ensure S = S. For example, if the |/3j|'s, 
for j G S, is at least t = (y + 2\/l + a)afi n , then it is not hard to see that the probability S = S is at least 
1 — 2/p a . Of course, the factor of ra obtained here, is larger than the corresponding factor for the orthogonal 
case, since the X matrix is in general quite far from being orthogonal; indeed, it is singular when p > n. 

As a consequence of the above, we state results demonstrating strong oracle inequalities for parameter 
estimation under the ^2-loss. 



2 RESULTS 



15 



2.2.1 Oracle inequalities under ^2-loss 



Let (3 be the coefficient estimate obtained after running the algorithm. More explicitly, (f3j : j € S) is simply 
the least squares estimate when Y is regressed on X§ and f3j — for j £ S c . 



We assume that the correlation matrix X satisfies (24), that is, 



7 (S) < <V(2fe), 

where < cjq < 1- 

For simplicity, we consider the case that /3 satisfies Q with v = \, that is, 



(37) 



(38) 



where |5| = k and 77 is allowed to grow at most linearly with k, that is 77 = 77/fc is 0(1). With z/ = 1, S 
denotes the set of indices greater than the noise level. 



For the above values of 77, uiq and with v = 1, evaluate the quantities s 



mini ^max 



as well as i7i, v\ and uj 



using expressions (26) and (27). Evaluate T2 as in (30), where the quantities A, \ m in, ^max are calculated 



using equations (28 29). Further, let £ and r be as in Corollary 2.5 Then we have the following 



Theorem 2.6. Let (31) and (38) hold. For fixed such (3, if 

n>t(k logp, 

then the following holds with probability at least 1 — p err ,k'- 

p 

||/3-/3|| 2 < C^min (/3f,a 2 /4), 
3=1 

where C = (4/9)r 2 . 



(39) 



The above theorem is essentially the analog of similar results for the Lasso [2"9l Corollary 6.1] and Dantzig 
selector [HI Theorem 1.2]. Note, the latter assumes that (5 is fc-sparse. Our results are more general since we 
only assume that the i\ norm of the smaller coefficients satisfies a certain bound. We proceed to state the 
corollary of the result assuming /3 is fc-sparse. 



For fc-sparse (3, we only assume that (37) holds. Take 77 = k, so that 77 = 1. Evaluate r2 using this values of 
77, and with v = 1, and call it r^, that is, 



(1 - Uq) Uq + 



■LJ 



(40) 
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where once again, the quantities r% and A TO j„ as calculated using (13 28) and equations (26) and (27). 
Further, let £* have the same expression as £, except it is evaluated using r\ instead of r 2 - Similarly, let 
?'* = 2r^p^l + a. Then we have the following. 



Corollary 2.7. Let (31) hold and let ft be a fixed k-sparse vector, for some k > 0. If 

n > £* k logp, 

then for C\ — (4/9)(r*) 2 , the following holds except on a set with probability p er r,k- 

||/3-^<Ci£mm(/3f,a 2 /4)- (41) 

3=1 

We now proceed to give proofs of our main results. The proofs employs techniques developed in Zhang |28j 
and Tropp and Gilbert [2"5] , 



3 Proof of results in Subsection 2.1 



Proof of Theorem \2.1\ The following statistics will be useful in our analysis. Denote, 

Zi = max \ Zij\ and Zi = max \ Zij\ 
jes a jeS% 

Notice if Zi > t and Zi > Zi, then the index detected in step i, that is a(i), belongs to S. 



(42) 



We first prove for the case k > 1. Let £ be the event that statement (17) in Theorem 2.1 does not hold 



We want to show that the probability of £ is small. There are two types of errors that we wish to control. 
Let £\ be the event that S in not contained in So- Further, let £2 be the event that S is contained is So, 
however J2jeP ft] > a \F\- Clearly £ = £1 U £ 2 - 

We use an argument similar to that used in Tropp and Gilbert j^S]. We initially pretend that X = Xs 
and that the coefficient vector ft is shortened to a k X 1 vector fts with all non-zero entries. Notice that 
Y = Xs fts + e - For a given threshold r, we run the algorithm on this truncated problem. Let m < k be 
the number of steps and let Ri, R 2 , ■ ■ ■ , R m be the associated residuals after each step. Also, denote as -R 
the vector Y. Notice that m, Rq, Ri, . . . , R m are functions of A = [Xs : e]. 



Let £ u be the event that statement (17) does not hold for the truncated problem. More explicitly, taking 
S 1 = S(Y,X So ,t) and Fi=S - Si, it is the event that ||/3^J| 2 > 



Denote T{ = max j6 s 



XJ Ri-i/\\Ri-i\\ 



and Ti = max, e « 



XJRi-i/\\Ri-i\ 



for i = 1. 



m + 1. Notice 



that the statistics Ti, Ti are similar to Zi, Zi, the only difference being that the residuals involved in the 
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former arise from running the algorithm on the truncated problem, whereas in the latter they arise from 
consideration of the original problem. Further, let £f be the event 



£f = /t, > t, fi > Ti for some i < m + 1 \ . 



We now show that £ C £ u U £t. To see this, write £ as a disjoint union £\ U £2, where £ 2 = £% fl £f. Let's 
first consider the case that £2 occurs. Clearly this means that £ u has occurred if the algorithm were run on 
the truncated problem for the given A. 

Next, consider the case that £1 occurs. Let Rq, R\ ■ ■ ■ etc. be the residuals for the original problem 0, for 
the given realization of [X : e]. Let i* be the step for which the false alarm occurs for the first time. Clearly, 
i* < m + 1, since otherwise it would mean that the truncated problem (with X = Xs ) ran for more than 
m steps. Also, we must have {Zi > r, Zi > Zi} occur for 1 < i < i* — 1 and {Zi* > r, > Z^} occur. 
Correspondingly, one sees that i?o = Ro, ■ ■ ■ , Ri*-i = Ri*—it which implies that T;» = Zi* and T^ — Zi* . 
Consequently, as {Ti* > r, Ti* > Ti*} occurs, £f occurs. Hence, £ C £ u U £f which gives, 

n£)<P(£u) + P(£f)- 

Consequently, all we are left with is to bound the probabilities of £f and £ u . 

We first bound the probability of £f. For this, notice that £f C £'^ where £'j — {maxi<i< m+ i T,; > r}. Since 
Xs<= is independent of A = [Xs a ■ e], one has that X$c is independent of . . . , R m . Correspondingly, from 
(a), conditional on A, we have that XJ Ri/\\Ri\\ is sub-gaussian with mean and scale 1, for 



A.l 



Lemma 

j G Sq and 1 < i < m + 1. Consequently, using standard results on the maximum of sub-Gaussian random 



variables (Lemma A.l (b)), if r be as in (11), one gets that P(£f\A) < 2(m + l)/p a , using |5q| < p. Since 
m < k, this probability is bounded by 2(k + l)/p a , which implies P(£ /) < 2(k + 1) /p a . 

Next, we bound the probability of £ u . For this, consider a linear model of the form, 

U = Hip + w, (43) 

where H is an n x k matrix satisfying, w an n x 1 vector and (p a k x 1 dimensional coefficient vector. 
After running the OMP on this model (with Y = U, X — H and threshold To), let S 2 = S(U, H, r ) be the 
estimate of the support. Further, let <p be the coefficient estimate obtained, that is, (ipj : j £ §2) is the least 
squares estimate when U is regressed on Hg^ and (pj = for j not in S2. We use the following Lemma, the 
proof of which is similar to the analysis in Zhang [28] . 



Lemma 3.1. For the model (4-3), let the following hold. 



(i) Condition 1 holds for H, that is the eigenvalues of H T H/n are between X m i n and \ r 
(ii) Condition 2 holds for w, that is \\w\\ 2 < na 2 X, for some A > 0. 
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(Hi) ||^; s — f\\oo < ctcqTqI ' \fn, for some constant cq > 0, where <pi s is the coefficient vector of the least square 
fit ofU on H. 

Under the above, if the OMP is run with Y = U , X = H and threshold tq, when the algorithm stops we 
must have the following, 



(a) 



(l - Toy/nk/nj \\<pp 2 \\ < 



\F 2 



(44) 



where F% = {1, ... ,k} — S2, denotes the indices not detected after running the algorithm. Further, n 



(b) 



has the same form as (13), replaced with the above values of X m i„, A max and A. Also, Ti — cq + ^Jr\ 



- w < 



\ip-cp 



(45) 



1 - Toy/rik/n 

The above lemma is proved in Appendix |Bj We only require the conclusions in part (a) of the lemma for the 



time being. Part (b) will be required of Subsection 2.2.1 to get bounds on ^2-error of the coefficient estimate 



Now apply Lemma 3.1 to the truncated problem, that is, with H = Xs , <p = Asv,, U — Y and tq = r. 



Notice that in this case F2 — F\ and S2 — S\. We know that requirements (i) and (ii) of the Lemma 
hold, except on a set £ con d- The following lemma shows that (iii) holds with high probability. 

Lemma 3.2. Let /3; s be the least squares fit when Y is regressed on Xs . Further, let 

£ls = {\\Pls - ASolloo > O-CoT/^/n}, 

where c = 1/^X~. Then P (£ ls n £ L cond ) < 2k/p 1+a . 



3.1 



The above lemma is proved after this proof. Using the above lemma, all requirements of Lemma |3.1| hold, 
except on a set £ u = £ ond U £i s , the probability of which is bounded by V(£ C ond) + 2k/p 1+a . We now show 
that £ u C £ u . We do this by showing £° C To see this, notice that on £°, one has 



(l - r^nk/nj 11% || <r 2 c 



\Fi\ 



(46) 



from (44). Assume that F\ is non-empty, since otherwise the claim is trivially true. Notice that since 



n > (1 + d)rikr 2 from (16l, one has t [kr\/ri\ < \j\J\ + 5. Now, since k < k, the left side of (46 1 is 



non- negative. Thus, (46) can be reexpressed as 



IWpX^ia'rlfiSy/nmi 
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which follows from noticing that r2 — where r-i is as in (131. Now, since n > a 2 r\f(S)r 2 j the 
left side of the above is at most Thus, J2jeFx^j — a l-^i| 011 which implies that £ u C £ u . 

Consequently, ¥(£ u ) < ¥(£ cond ) + 2k/p 1+a . Accordingly, since ¥{£) < ¥(£ u ) + F(£ f ), one has ¥(£) < 
V(£cond) + 2k/p 1+a + 2(k + l)/p a , which is equal to p er r,k- This completes the proof for the case k > 1. 

For the case k = 0, we just need to show that the algorithm stops after the first step, in which case S = 0. 
This is immediately seen by noticing that for k = 0, one has that Z±j, for j G J, are sub-gaussian with mean 



and scale 1. Correspondingly, from Lemma A.l'b), the event {max je j \Z\j\ > t} has probability at most 



p err ,0 = 2/p a . □ 

Proof of Lemma \3.£\ Firstly, note that /3; s — (3s can be expressed as Z — (Xg (j Xs )^ 1 Xg g e. Let Z = 
(Zj : j = l,...,fc). Now, conditioned on Xg , each Zj is sub-gaussian with mean and scale o~j = 



cry e'j (Xg Q X s ) 1 ej. Here, ej is the j th column of the size k identity matrix. Correspondingly, from Lemma 
A.l ^b), one gets max^ \Zj \ is less than (max^ <7j)t, except on a set with probability 2k/p 1+a . Finally, observe 



that on £^ ond , one has eJ(Xg g Xs ) ej < l/(n\ m i n ), since the maximum eigenvalue of (Xg o Xs /n) 1 is at 
most l/A m i n . Thus, maxj <7jT is at most 0~c o T/yJn, with cq = l/V-^mm- D 



Proof of Corollary \2.S\ Take a{8) = a 2 /[(l + S)k]. Further, let £(S) — £(a(5), 6), which, using r\ > ri and 
/(<5) > 1) can be written as, 

m = (l + 8)f(5)ri (47) 

The function (1 + S)f(S), for S > 0, has its minimum at S* = 3. Further, it is increasing and goes to infinity 
as S tends to infinity. Now, using £(<5*) — lQr 2 1 notice that £,(8*)kT 2 = £klogp. Correspondingly, since 
n > £fclogp, one gets that 

n = £(S)kT 2 , (48) 



for some 5 > 5*. Consequently, from Theorem 2.1 one has 



SCSq and J2 Pi ^ a ( S )\F\> ( 49 ) 



with probability at least 1 — p e rr,k- Use f(S) < f(5*) — 4, to get from (48 1 that n < (1 + S)rkr 2 . 
Correspondingly, a(8) is at most r 2 ^ 2 /^. Consequently, any j, with \/3j \ > ra\fk[i n cannot be in F since it 



would contradict the inequality in (49 1. Further, if /3 m i„ > rcr/i n , the inequality in (49 1 cannot hold if F is 



non-empty. In this case the algorithm recovers the entire support. □ 
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4 Proof of results in Subsection 12.2 



Proof of Theorem \2.4\ Once again, we first prove for the case k > 1. As before, we are interested in bounding 
the probability of £, where £ = £\ \J£%. Here £\ is the event that S is not contained in S = S((3). Also, £ 2 is 
the event S C S and \\P p \\ 2 > a\F\, where, here F = S - S and S = S(Y, X, n). Write Y as Y = Xgfig + e, 
where e = Xgcf3gc + e. Analogous to before, we initially pretend that X = Xg and j3 = fig and run the 
algorithm on the truncated problem to get residuals Rq, Ri, R2, ■ ■ ■ , Rm- These residuals are functions of 
[Xg : e]. Further, as before, let £ u be the event that statement (34) is not met for this truncated 



.4 



problem. With §1 — S(Y, Xg, n) and F± = S — Si, it is the event that \\/3p || 2 > a|Fi|- Similarly, we define 
Tj, Tj as before, now with the maximum taken over S instead of Sq. Further, define the event £f analogous 



to before, with r replaced by t\. Using the same reasoning as in Theorem 2.1 one has £ C £ u U £f. We 
first proceed to bound the probability of £/. Notice that unlike previously, the Xj's, for j € S c , are not 
independent of the R^s. This makes bounding the probability of £f more involved. 

The following lemma will be useful, both in bounding P(£/) as well as P(£ u ). We denote as /3; s the least 
square estimate when Y is regressed on Xg. 



Lemma 4.1. Parts (i)-(iii) of this lemma demonstrate that requirements (i)-(iii) of Lemma 3.1 are satisfied 
with high probability. 



(i) With Xmim \nax as wi (28), the following holds with probability at least 1 - 2/p: 

\\v\\ 2 < \\Xgv\\ 2 /n < \ ma x\\v\\ 2 for all veR k . 



(50) 



(ii) Let A be as in (29). Then ||e|| 2 /(n<7 2 ) < A, with probability at least 1 — 1/p. 
(in) Let £ ts = {\\/3i s - Ps\\ac > o-cqti/^/h}, where 

C = (1 - LO) 







jl + vi-q 




V\ + 


V ^rnin 







(51) 

Then V(£^ ond n £i s ) < (y^2 /ir)k / (rp 1+a ) , where £ CO nd, here, is the event that (i) or (ii) above fails. 
From (i) and (ii) it has probability at most 3/ p. 

The above lemma is proved in Section[5j As mentioned before, the Xj 's, for j G S c , are not independent of the 
Ri's. We get around this by finding the conditional distribution of each Xj given Xg and e. Correspondingly, 
each Xj may be represented as a linear combination of columns in A = [Xg : e] plus a noise vector, which 
we call Zj. This noise term is independent of A and hence Rq, R%,..., R m . 



Let a,j = and 



e ]^s<=\sPs 
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where ej is the jth column of the size p — k identity matrix and 

d = a 2 + /35eE S c| S /3 S c. 

The following lemma characterizes the conditional distribution of Xj given A. 
Lemma 4.2. Let a,j, bj, for j € S c , be as above. Then we have the following: 

(i) The distribution of Xj, for j 6 S c , may be represented as 

Xj = Xsa, I l>,\\ ■ Z, 



(53) 



(54) 



where W <~ N(0,I n ) and is independent of Xg. Further, Zj is independent of [Xg : e] and follows 
N(0,ajjl n ), with <jjj < Ojj = 1. 



(ii) Define, for j € S c and i = 1, . . . , m + 1, 

ll-Ri-xll 

where Ej t — ZJi?j_i/||-Rj_i||. Lei, 

I KKm+l, j£S c J 

Then¥(S f ) < 1/p + {^/2/^){k + l)/(rp a ). 



(55) 



(56) 



The above lemma is proved in SectionJHJ We now show that £f C £f. To see this, notice that on £j one has, 



T t < (maxllaj-lli)^ + (1 -u))ti 
< coTi + (1 - u)tx, 



(57) 



< l|a,-||iTi, 



for i = l,...,m+l. Here, the first inequality follows from using (54) and 
along with the fact that \Vji\ is bounded by (1 — u))t\ on £j. The second inequality follows from (201. We 
now show that 

£' = |f 4 < LuTi + (1 - u))n for each i<m+l\ 



implies £j. To see this, for each i, consider two cases, viz. Tj > r\ and Tj < n. From (|57|), in the first case 



one has T, < T i; and in the second case, one has Tj < ri< Correspondingly, £' is contained in 

{fi < Ti or Tj < n for each i < m + 1}, 
which is £f. Consequently, £f C £y. Consequently, P(£/) < l/p + {y/2/ir)(k + l)/(rp a ) from Lemma 



4.2 
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What remains to be seen is that the probability of the event £ u can be bounded as before. For this we apply 



Lemma 3.1 once again. That conditions (i) - (iii), required for application of Lemma 3.1 are satisfied with 
high probability is proved parts (i)-(iii) of Lemma 4.1 Consequently, as before, if £ u = £ con d U £u, where 



the sets on the right side are as in Lemma 4.1 one gets that on £. 



(l - nVrWri) \\p h \\<f 2 <rr 1 '^- 



(58) 



Here f 2 = cq + y/ri, where cq as in (51 1. Notice that f 2 — r%, where r 2 as in (30). Now, once again use the 



fact that n > (1 + 5)r x kTl and n > r 2 f(5)a 2 T 2 /a, to get that (|58| implies ££. Accordingly, F(£ u ) < V(£ u ). 
Consequently, one has, 



¥(£)<¥(£ u U£ f ) 

< P{£cond) + 



, d n£ ls )+¥(£ f ), 



which is at most p erTt k — 4/P + (V^7 7r / r ) [(^ + l)/p a + fc/p 1+a ] ■ This completes the proof for k > 1. 

If k = 0, wc will show that the probability that max Jg j exceeds T\ is at most p er r,o- This would imply 
that the algorithm stops after one step and S is empty. Notice that S c = J and hence e = Y. Consequently, 
X, = bjY/ay + Zj 7 where Zj ~ iV(0,a' :) ) is independent of Y, with dj < 1. Also, fej = eJ£/3/ov, where 
cry = Var(Yi) — a 2 + /3 T S/3. Correspondingly, 



ZyE l 3 \\Y\\/a Y + ZJ^- 



(59) 



Using oy > (j, one has < Further, using ||y||/<7y < (1 + /i„), with probability at least 1 — 1/p from 

Lemma . 



A.2 



one has that the first term in the right side of ( 59 ) is at most ^it(1 + k 1 / 2 ) with probability at 
Further \ZJY/\\Y\\\, - 
at least 1 — \J2/-k /(rp°) (Lemma . 



least 1 — 1/p. Further |, using the independence of Zj and Y, is less than r for all j with probability 

(b)). Denoting, r 2 = [^i(l + fc~ 1//2 ) + l]r, one sees max je j \Zij\ < r 2 , 
with probability at least 1— p err: o- Notice that since t± > r 2 , the event max^g j | < n also has probability 
at least 1 — p er r. o- This completes the proof. □ 



Proof of Corollary \2. 5] The proof is exactly similar to that of Corollary 2.2 As before, taking a(5) = 
<7 2 /[(l + 8)k] and £((5) = £(a(<5), S), we notice that p 2 £,(5*)kT 2 = £ fclogp, where 6* = 3. Correspondingly, if 
n > £fclogp, one has n — p 2 £(5)kT 2 for some 8 > 5* and hence, 



SCS and 



<a(S)\F\ 



with probability at least 1 —p err ,k, from Theorem 2.4 Further, a(S) is at most r 2 o~ 2 p? n , using the same 
reasoning as before. The conclusions on recovering the large coefficients follow immediately from this. □ 
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Proof of Theorem 2.6 Notice that 



11/8 -/Sf = lift?- foil 2 + ll/V- fo* II 2 - 



(60) 



We apply the result of Corollary 2.5 to get that except on a set with probability p er r,k, one has S C 5. 
Correspondingly, the second term in (601 is simply ||foc|| 2 , which is equal to X)jgS c mm {/^|) 0-2 Mn}- 



Let's next concentrate on the first term in (60 1. Notice that since S C S, one has (3s is same as the coefficient 



estimate one would get if the OMP were run on the truncated problem. Correspondingly, using part (b) of 
Lemma |3.1[ with t = T\ and f 2 = r 2 , one gets that 

riOT\ \J~kfn 



lift? -foil < 



(61) 



1 - ti y/nkjn ' 

with probability at least f — p eTT) Next, use the fact that T\\Jkjn < l/(4r2) using ££ilogp = 16r?,kT 2 



Consequently the denominator in the right side of (61) is at least 1 — v / rT/4r2. The latter is at least 3/4 
using r2 > -Jr~\. Thus, 



llfo-foll < g crVfc^„, 

CaVkn n , 



(62) 



where C = (4/9)r 2 . Correspondingly, from (601 one gets that, 



ftf < C* 2 k»l + £ mMft?, o*fi} 



jes<= 



<C^min{/3 2 ,aV 2 }, 

where the last inequality from using cr 2 fc/i 2 = X^eS mm {/8|i ^M 2 }? since S 1 = {j : |ft| > crfi n } 



□ 



Proof of Corollary \2.1\ For fc-sparse /?, once again let S — {j : \/3j\ > afi n }. Now ||foc||i < rjaii ni where 
rj = k, since there are at most k non-zero entries outside of S, with magnitude at most er/i„. Now apply 



Theorem 2.6 with rj = k (or rj — 1) to get the desired result. 



□ 



5 Proof of results from Section |4] 



The following simple lemma will prove useful in proving Lemma |4.1 



Lemma 5.1. Let 9 n = k 1 ! 2 ^. Conditions (19) - (21) imply the following 



(i) Let d be as in (53). Then d < a 2 (l + vifj9 2 ). 



(ii) W^ssgf < a 2 s 2 max v x 6l, where g = S S sE S S^s- 
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Remark: Since we take n > 2k log p, we have 9 n < 1. Accordingly, the above bound holds with 9 n replaced 
by 1. 



Proof of Lemma \5.1\ We first prove part (i). Recall that d = a 2 + [3g c 'Egc\gf3gc . Write Pg e 11gaig/3gc as 



Z)jeS° PjejUs^sPs", which can be bounded by (IIE^Is/J^IIm)!!/^!!!, which is at most avir)9„ from (21) 



and (10 1. This completes the proof. 



For part (ii) use the fact that ||Ess<7|| 2 < s^ Q J|(7|| 2 from (19) and < aykviix n from (21), to complete 
the proof. □ 



Proof of Lemma 4-1 We use a result in Szarek [2lj that gives tails bounds for the largest and smallest 
singular values of Gaussian random matrices. Let U € M. nxk be a matrix with i.i.d. standard Gaussian 
entries. Then, for r > 0, one has, 

P(A fc (U/Vn) > 1 + \fk~J^ + r)< e -™" 2 / 2 

P(Ai (U/y/n) < 1 - yfkpn-r) < e" nr2/2 , 

where Afe(.) and Ai(.) gives the largest and smallest singular values respectively, of an n x k matrix. Now, 
taking r = p, n , one has, using the above, that with probability at 1 — 2/p the following holds: 

hf\\v\\ 2 < -\\Uv\\ 2 < hjv\\ 2 for all v G R k . 



Now, notice that since X$ = UT^Jg, one has from the above that, with probability at least 1 — 2/p, 

hfW^vW 2 < 1 ||X.< ? «|| 2 < hj\^v\\ 2 for all v e R k ■ 



J ss • 



Correspondingly, from (19), since s m i n < W^J^vW 2 /\\v\\ 2 < s max , which implies that, with probability at 
least 1 — 2/p, 



n 



\v\\ 2 for all v E R k , 



where \ min , X max as in (29) 



Before proving parts (ii) and (iii), observe that by conditioning on Xg, the distribution of e may be expressed 



e = X s g + VdW, 



(63) 



where g = ¥, ss Y,ggc(3gc and d as in (53). Here W ~ N(0,I n ) and is independent of Xc 



For part (ii), notice that from the above a 2 := Var(ei) = ||S5sg|| 2 +d, which is at most a 2 (l + s 2 nax v\' -\-v\r\) 
from Lemma 5.1 Further, ||e|| 2 /(T 2 ~ X 2 . Now from Lemma A. 2 the probability of the event ||e|| 2 /(no- 2 ) > 



5 PROOF OF RESULTS FROM SECTION ?? 



25 



(l+Hn) 2 is bounded l/p. Use fi n < k^ 1 / 2 and a 2 < a 2 (1 + s 2 nax i7i 2 + vifj) , to get that P (||e|| 2 /(?icr 2 ) > A) < 



1/p, where A as in (29) 



For part (iii), notice that $i s — f3s = (XgXs) 1 Xgi, which using (63), can be expressed as, 

L~Ps = 9 + VdiX^Xg^X^W. (64) 
Let Sis = { y /d\\(XsXs)~ 1 XgW\\ 00 > cr y/1 + v\t\t j \J \ m i n n} . Now, since W is independent of Xs, and 



d < cr 2 (l + Uiff), one can use the same logic as in the proof of Lemma 3.2 to get that, P(£^ ond D £i s ) < 
yj2/Ttk/(Tp 1+a ). Further, \\g\\oo < crutfi n using (21), which, using p, n < r/y^n, is at most o"U\rj\fn. 
Accordingly, on £^ ond D £f 8 , one has, 



\\$u-Ps\U<° 



1 + V\J] 



-/Vn, 



cp t 
\/n 1 — u> ' 



where cq as in (51). Now use r/(l — w) < ri, to get that P(£^ ond D £; s ) < y/2/nk/(Tp 1+a ). This completes 



the proof of the lemma. 



□ 



Proof of Lemma \4-2\ We first prove part (i). Recall, from (63), one has, e = Xsg + VdW, where g = 



(Ess) 1 Ess c /3s c and d as in (53). Further, W is independent of Xs and follows N(0,I n ). Correspondingly, 
the conditional distribution of Xj given [Xs : W] may be expressed as, 

Xj = X s a 3 +b 3 W + Zj 

where a 3 = Cav(Xi j s, ^ij)[Var(Xi.s)] _1 and b 3 = Gav(X\j,W\). Further, Z 3 ~ N(0, a 33 T n ) and is inde- 
pendent of Xs and W, with 

°jj = Vjj - "! >; >>".< - '>7- 

which is at most 1. Clearly, the expression for a 3 matches that given in the statement of the lemma. Further, 



from (63), one has that, 



CovCX^Wi) = -j= [Cov(X l3 ,h) - Cov{X l3 , Xi.sg)} . 

Notice that Cov(Xy, ei) = Ejs c /3s c and Cov(Xy, Xx^g) — Cav(Xij,Xx t s)g, which is EjsE^Ess c /3s c - 
Correspondingly, the numerator of the above is eJT,s^\sPs c , and hence, the expression for b 3 given above 



matches that in ( 52 ) 



We now prove part (ii) of Lemma 4.2 Firstly, notice that maxj e s c < ^lMn- This follows from observing 
that d > a 2 , from (53), and also the fact that |eyEscis/?s c I < o~V\\i n , for all j € S c , from (21). 
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Recall the statistic Vji given by (55). One sees that, 

\Vji\ < \b,\\\W\\ + \Eji\ . (65) 



Now ||VF|| 2 ~ X^. Correspondingly, from Lemma A. 2 the event {|| IVH/^/n > (1 + M™)} nas probability at 
most 1/p. 

Further, Z^s are independent of [Xg : e] and, hence, are also independent of Ro, ■ ■ ■ , R m , since these residuals 
are functions of [Xg : e]. Consequently, the Ej^s are standard normal random variables; Indeed, conditional 
on the Ri's, they follow N(0, 1), and hence, follow the same distribution unconditionally. Accordingly, using 
the same logic as in the proof of Theorem |2.1| the event 



1< < m Z? ■ P J E ^ >T \ ( 66 ) 
l<z<m+l, j£S c j 

has probability bounded by y /, 2/n(k + l)/(rp a ). 

Consequently, using the bounds on \bj\ and the above, one gets that except on a set with probability 
1/p + ^j2/ir(k + l)/(rp a ), one has 

max | Voi | < v iii n \fn (1 + jj, n ) + r. 

l<i<m+l, j£S c 

Using t > \i n \fn and fj, n < fc -1 / 2 , the right side of the above is at most (1 — w)tj.. This completes the proof 
of the lemma. □ 



6 Conclusion 

The paper analyzed variable selection for the OMP for random X matrices. We analyzed performance with 
i.i.d sub-Gaussian designs, which has uses in compressed sensing. We remark that for these i.i.d designs, the 
analysis carries over for the hard thresholded version of the algorithm, in which, instead of choosing the j 
which maximizes the |i£y|'s, one chooses all j satisfying \Zij\ > r. It is only when there is some correlation 
within the rows that we find it advantageous to choose the index which maximizes \Z%j\. 

For Gaussian designs, with correlation within rows, we give much more general results. Apart from showing 
that results similar to that in for exact support recovery, are also possible using the OMP, we show 
additional recovery properties by relaxing the assumption of exact sparsity to a more realistic assumption 
of a control over the £i-norm of the smaller coefficients. Oracle inequalities for the coefficient estimate also 
followed easily as a consequence of these results. 



As mentioned earlier, one drawback of the analysis is the crude manner in which the probability of event 
(66), that no terms outside of S are selected, is bounded. This gives rise to the \[2pK(k + \)/{rp a ) term in 
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the expression for p er r,k (31 1, because of which a has to be greater than 1 when k is not negligible compared 
to p. In |14j . a more careful analysis had been carried out for exact recovery with i.i.d. designs and ^o _ sparse 
vectors. We believe that their analysis should carry over for the general case analyzed here, by noting that 



the random variables Eji, for i = 1, . . . , m + 1, defined in Lemma 4.2 has the same covariance structure as a 
normalized Brownian motion at times ti,..., t m+ \, where = ||i?i_i|| 2 . This should improve the probability 



of the event (66) to something closer to l/p a . 



For random designs, we measure the performance after averaging over the distribution of X. As mentioned 
before, this can be contrasted to another method, as done in Candes and Plan [7] for the Lasso, in which a 
distribution is assigned to f3 and the performance is measured after averaging over this distribution. Although 
these two methods do not imply each other, it is interesting to compare the average performance using both 
methods. To be consistent with their notation, let's assume that the entries of X are scaled so that the 
columns have norm equal (or nearly equal) to one. Under a mild assumption on the incoherence, it is shown 
that for €o _ sparse vectors the support can be recovered, if 

k = 0( P /[\lX\flogp}) 7 (67) 

where \\X\\ denotes the spectral norm of X. If X has i.i.d N(0, 1/n) entries, then |||A||| ps y/p/n, so that 



the sparsity requirement (67) would translate to k = 0(n/logp), which is of the same order as what we 
get here. However, the situation is different in the general case when the rows are i.i.d N(Q,H/n). Then 
X may be expressed as XS 1 / 2 , where X has i.i.d N(Q,l/n) entries. Consider the example where S« = 1 



and Ejj = c/k, when i ^ j, with c appropriately chosen. In this case |||A||| sa c'p/vnk. Consequently, (67) 
translates to assuming n = Q(p logp). Our results are better in this case, since we only require fi(fclogp) 
observations even for such correlated designs. 

An advantage of the work in [7] is its applicability to broad classes of deterministic designs. It is unclear at 
this stage whether such results also hold for the OMP. 



A Tail bounds 

A random variable Z is said to be sub-gaussian with mean and scale a > 0, if Ee tz < e* " I 2 for each 
t G K. 

Lemma A.l. Let W — (Wj : 1 < j < n) T , with each Wj sub-gaussian with mean and scale Uj > 0. Let 
a = maxj The following hold. 

(a) Let h G K™, with \\h\\ < 1. If the entries ofW are independent then h T W is sub-gaussian with mean 
and scale a. 
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(b) Let p = CTv /2(l + a) \ogp with a > 0. Then P(xaaxj \Wj\ > p) < 2n/p 1+a . Further, if the Wj ~ iV(0, cr 2 ) 
then this probability can be bounded by \J2~pK(an)/{pp 1+a ). 

Proof. For part (a), we need to show that Eexpji h T W} < exp{i 2 c 2 /2}. To see this, notice that Eexp{i h T W} 
Eexp |i 2 Y^j=i h^rf /2^, using independence of Wj's. The claim is proved by noticing that 53j=i ^i' 7 !/^ — 
cr 2 , using \\h\\ < 1 and o~j < a. 

For part (b), use a Chernoff bound, followed by optimizing the exponent to get that, 

F(\Wj\>p)<2exp (-^ 

If the W/s were normal, standard tail bounds [13: reveals that the above bound can be improved to 
(2/(v / 2~7r j o)) exp ^—^2^- Now use a union bound, along with the fact that exp (^—^oi^j = l/p 1+a , to prove 
the claim. □ 

Next we give a simple lemma on chi-square tail bounds, which will be used repeatedly. 
Lemma A.2. Let W follow N(0, I n ). Then 

P{\\W\\/Vn->l + p n )<l/p, (68) 

where p n = ^/(21ogp)/n. 

Proof. Use the fact (see for example [11]) that for h > 0, one has 

P(||W||/Vn> l + h)< e- nh2/2 . 
Substitute h = J (2 log p)/n to get the result. □ 



B Proof of Lemma 3.1 



For convenience, let S = {1, . . . , k}. Let Hj, 1 < j < k denote the columns of the H matrix. Assume that 
the algorithm runs for m steps and let Ri, . . . , R m -i denote the associated residuals. Let Rq — Y. Denote 
as Ua, the least square fit when U is regressed on H^. We also denote as u{i) = S — d(i), which corresponds 
to the terms in S undetected after step i. We assume u(0) = S and Ud(o) = 0. 

The following lemma is from Zhang 28J. 

Lemma B.l. (Zhang 1281) For each i, with < i < m, if \u(i)\ > 0, then 

\U d{i) -Us\\ 



max 



HJ Rl 
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The results is a consequence of Lemmas 6 and 7 in Zhang [281 page 566]. Using his notation, in our case, 
Xmin = P(F), R % =Y- Xfl^-V, U d(l) = X^-V, U s = Xp x (F, y) and u{i) — F — F^" 1 ). 

Lemma B.2. For each i, with < i < m, one has 



\\Ri\\/Vn< \fx max (\\<Pu(i)\\ +0-), 

where X max = max{A, \ ma x}- 



Proof of B.2 Write Ri = (I — Vi)U, where here Vi is the projection matrix for column space of -HWiV Now 
U = H d (i)<p d (i) + H u ( {) ip u{l ) + e and (I - Vi)H d ^ = 0. Correspondingly, = (7 - Vi)[H u (;)</?„(;) + e]. 
Consequently, ||i?j|| < ||-ff«(i)</?ti(i) II + ll e ll> since || (I — V^xW < \\x\\ for any x E M. n . The result immediately 
follows from using \\H u (i)(p u (i)\\ / \/n < \/X max \\(p u ^\\ and ||e||/(-\/^' cr ) — vX. This completes the proof of the 
lemma. □ 



Now use the fact that \\Hj\\ > y/ny/\ m i ni to get from Lemma 



B.l 



that, 



max \H]Ri\ > J rjWtW d (i) ~ &s\l 
;£»(•) 1 J 1 V H l )l 



where p\ = Xf nin . Consequently, using Lemma B.2 and the above, one has that, 

np 2 \\U d{ i) -Us\\/Vn 



max 



3 \\Ri\ 



> 



II ¥>«(*) II + CT 

where p2 = pi/X ma x- The algorithm continues as long as the left side of the above is at least tq. Consequently, 
following the reasoning in [23], when the algorithm stops, one must have that either li^bl = or the right 
side of the above, with u(i) replaced by F2, is at most tq. Let's assume that IF2I > 0, since otherwise we 
would have correctly decoded all terms. Correspondingly, we have, 

\\U$ ~ Us\\/V^ < tq J ^(WvpJ + o-) (69) 

V r2 

when the algorithm stops. Now, 

\\<PpJ ^ \f\^\\\f - <Pls\\oo + \\<fu - 1P||- (70) 

To see this note that ||y^ 2 || is bounded by the sum of \\<pp — <fi s p \\ and \\<Pi s p 2 \\, where (p ls p 2 is the 
sub- vector of (pi s with indices in F%. The hrst term in the bound is at most y |i<2 1|| (f — <^; s ||ooj whereas the 
second term can be bounded by \\<f>i s — <p\\, since tpj is zero for all indices j in F 2 . Now, use the fact that 
Wfiis — f\\oo is bounded by c^oto/ yfn along with the fact that \\U S — Us\\/y/n > \/\ m in \\<p — 'PlsWi to get that 



from (69) and (70) that, 



\<PpA <Co<tt J^ +^^(11^11 +a) (71) 
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when the algorithm stops. Here we use that n — l/{X m inP2)- One gets from (71) that 

l-r 1 



"" JFj| 1 W||<r 3 aroVl^l/Vn, 



(72) 



where f% — cq + ^fr\ and r\ = 1/ p. Using \F%\ < k, the term Toy r\\Fi\jn appearing in the left side of the 



above can be bounded by r y/r\k/n. This leads us to ( 44 ) , which completes the proof of part (a) . 
For part (b), notice that 



\\0 - v|| < Vk\\<Pis - v\\oo + Wfiis ~ fill- 



(73) 



Now use, 



along with, 



Wis - 0|| < To^rik/niWippJ + a) 
f 2 <7T ^/k/n 



Wf 2 



(l - TQ^rxk/nj 



(74) 



to get, after rearranging, that, 



ii- -u ^ r-r-ricoTox/k/^ + i) 

Wis - <P\\ < (TToy/nk/n . 

1 - To^/rik/n 



Now use ||0; s — y||oo < crcoToyfc/n, along with r2 = Co + ^/ri, to get from (73) and the above that, 

f 2 (JT ^/kJn 



\(f-(f\ 



(l - To^rik/nj 



This completes the proof of the lemma. 



C Proof of Lemma 2.3 



For a matrix A € ]R nxm , and a = 1 or oo, denote as \\A\\ a = sup^g ||.A«|| a /||t>|| a . Recall that |||j4|||i is the 
maximum of the l\ norms of the columns, whereas |-A|oo is the maximum of the i\ norms of the rows. 

We first prove part (i). We use Cai and Wang [5J Lemma 2], to get that 

1 - y(k - 1) < s min < s max < 1 + 7(fe - 1). 

Now 7 < u>o/(2k), since k <k, and hence, the left side of the above is at least 1 — loq/2 and the right side is 
at most 1 + uio/2. Further, use Tropp Theorem 3.5], to get that 
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The right side of the above is at most u>q. Correspondingly, we may take u) as ojq. 
We next prove part (ii). Use the fact that, 

ll S SS S SS c As<=||oo < III III oo 1 1 /3 Sc 1 1 oo . (75) 

Now as Egg is symmetric, lEggloo = |||Eg S |||i; the latter is at most 1/(1 — ^{k — 1)) from [331 Theorem 3.5]. 



Further, \\^ss a Ps c ||oo < 7llAs c l|i: which is at most o-jr/fi n . Correspondingly, from (75), one gets 

ll E ss E ss<=/3s<=||oo < o" 1 _ Z k _ ^ T?Mn- (76) 
The right of the above is at most abJofjfj, n , using the bound on 7. Further, 

IISs-lslloo < IISs^S^S^loo + ||E S csEs 5 Ess<=/3s<=||oo- (77) 

Now, \\E S o S aj3s4oo < \\Ps« II 00 + II (p s >S'-I)Ps<> Hoc- Further, use H^s-Hoo < <"^n and ||(E<?= S c -J)&HU < 



7ll/3s c ||i, the right side of which is at most a^r\\x n . Also, the second term in (77) can be bounded as follows: 

- 1 \- a ii ^ iiix^ in ll v 1' 



|£s<=s£ss£ss c As c lloo < |||Es c sll|oo||S iSiS Sssc/? S c 



The first term in the right side product is bounded by ^yk, whereas the second term, from (76), is bounded 
by aujQfjfi n . Correspondingly, one gets that 

l|Esc| S /3sc||oo < aiyj_i n + a^r\[i n + ajuj n^j, n . 

Further, using 777 + jrjujQ < 2777, which is at most luqVi one E e ^ s the bound on HEgcis/Oscd^. 

For k = 0, one has WSsfS^Ps" < v + ujQf), which is at most v + ujof), from the bound derived above. This 
completes the proof of the lemma. 
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